no_tasks=2
seed = 9384
outcome = 'i_drug'
fbase = '/scratch/p_gaydosh_lab/DSI/'
results_directory <- str_c(fbase, outcome, '/results_full_h2o_2_deleted')
set.seed(seed)Purpose. In this work, we will explore the relation between identified measures of despair of interest (e.g., personality measures of self-consciousness, individual and composite item scores from the CES-D assessment) and descriptors of diseases of despair. We will achieve this goal through modeling the outcomes based on the included predictors, and robustly assess the importance of the included features in predicting the outcomes via bootstrapping. We will use two well-known machine learning models, random forests and LASSO, which are both frequently used to measure the relative importance of the predictors included in the models. Lastly, we’ll generate trained and tuned models using this reduced feature set which can be used by others wish to predict the identified outcomes.
Subject inclusion. For this investigation, we will omit the entirety of Wave 2. This is commonly done in analyses of AddHealth data due the design of the original study. Otherwise, our dataset will include only subjects who have predictor and outcome data in all of the waves.
Outcome variables. In this experiment, we assess suicidal ideation at Wave 5.
Predictor variables. The predictors for these models are hand-picked, and based on previous work, relevance, and subject matter expertise. The set of predictors and the set of outcomes are disjoint. Predictors from Waves 1-4 (excluding Wave 2, see above) are included, and will be detailed in the following analysis.
The predictors we will be using will be the the variable predictor_list loaded from 10-import-data.Rmd file. These initial set of predictors will be based of the list of variables that describe anxiety, depression, and optimism.
## set outcome variable of interest
filebase = '/scratch/p_gaydosh_lab'
#create data in specified form
dataset_list <- generate_datasets(outcome, binarize=FALSE, filebase=filebase, seed_val=seed)Parsed with column specification:
cols(
variable = [31mcol_character()[39m,
wave = [32mcol_double()[39m,
na_level = [32mcol_double()[39m,
type = [31mcol_character()[39m
)
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument '[future.]seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use [future].seed=NULL, or set option 'future.rng.onMisuse' to "ignore".0 subjects removed from dataset.
[1] "Recoding Missing Factor Variables"
[1] "Number of factor variables being recoded : 98"
[1] "Factor variables being recoded : "
[1] "h1fs1" "h1fs3" "h1fs4" "h1fs5" "h1fs6" "h1fs7" "h1fs11" "h1fs15" "h1fs16" "h1fs17"
[11] "h1fs19" "h1ee14" "h1ee12" "h1ds14" "h1ed9" "h1ed7" "h1ds3" "h1fv7" "h1fv1" "h1fv8"
[21] "h1jo9" "h1ds13" "h1ds12" "h1ds11" "h3sp5" "h3sp6" "h3sp7" "h3sp8" "h3sp9" "h3sp10"
[31] "h3sp11" "h3sp12" "h3sp13" "h3ec56" "h3ds18h" "h3ds18a" "h3ds18i" "h3to49" "h3ds6" "h3ds5"
[41] "h3ds4" "h4id5h" "h4mh18" "h4mh19" "h4mh20" "h4mh21" "h4mh22" "h4mh23" "h4mh24" "h4mh25"
[51] "h4mh26" "h4mh27" "h4id5j" "h4pe6" "h4pe14" "h4pe22" "h4pe30" "h4pe7" "h4pe15" "h4pe23"
[61] "h4pe31" "h4mh3" "h4mh4" "h4mh5" "h4mh6" "h4mh2" "hdl" "ldl" "tg" "h4bpcls"
[71] "h4ds7" "h4ds19" "h4ds14" "h4ds20" "h4ds6" "h4ds5" "h4ds4" "h4cj17" "h5id6g" "h5ss0a"
[81] "h5ss0b" "h5ss0c" "h5ss0d" "h5ss0e" "h5id6i" "h5pe1" "h5pe2" "h5pe3" "h5mn1" "h5mn2"
[91] "h5mn3" "h5mn4" "h5cj1d" "h5cj1e" "h5cj1f" "h5cj1b" "h5cj1c" "h5cj1a"
[1] "h1fs1"
[1] "h1fs3"
[1] "h1fs4"
[1] "h1fs5"
[1] "h1fs6"
[1] "h1fs7"
[1] "h1fs11"
[1] "h1fs15"
[1] "h1fs16"
[1] "h1fs17"
[1] "h1fs19"
[1] "h1ee14"
[1] "h1ee12"
[1] "h1ds14"
[1] "h1ed9"
[1] "h1ed7"
[1] "h1ds3"
[1] "h1fv7"
[1] "h1fv1"
[1] "h1fv8"
[1] "h1jo9"
[1] "h1ds13"
[1] "h1ds12"
[1] "h1ds11"
[1] "h3sp5"
[1] "h3sp6"
[1] "h3sp7"
[1] "h3sp8"
[1] "h3sp9"
[1] "h3sp10"
[1] "h3sp11"
[1] "h3sp12"
[1] "h3sp13"
[1] "h3ec56"
[1] "h3ds18h"
[1] "h3ds18a"
[1] "h3ds18i"
[1] "h3to49"
[1] "h3ds6"
[1] "h3ds5"
[1] "h3ds4"
[1] "h4id5h"
[1] "h4mh18"
[1] "h4mh19"
[1] "h4mh20"
[1] "h4mh21"
[1] "h4mh22"
[1] "h4mh23"
[1] "h4mh24"
[1] "h4mh25"
[1] "h4mh26"
[1] "h4mh27"
[1] "h4id5j"
[1] "h4pe6"
[1] "h4pe14"
[1] "h4pe22"
[1] "h4pe30"
[1] "h4pe7"
[1] "h4pe15"
[1] "h4pe23"
[1] "h4pe31"
[1] "h4mh3"
[1] "h4mh4"
[1] "h4mh5"
[1] "h4mh6"
[1] "h4mh2"
[1] "hdl"
[1] "ldl"
[1] "tg"
[1] "h4bpcls"
[1] "h4ds7"
[1] "h4ds19"
[1] "h4ds14"
[1] "h4ds20"
[1] "h4ds6"
[1] "h4ds5"
[1] "h4ds4"
[1] "h4cj17"
[1] "h5id6g"
[1] "h5ss0a"
[1] "h5ss0b"
[1] "h5ss0c"
[1] "h5ss0d"
[1] "h5ss0e"
[1] "h5id6i"
[1] "h5pe1"
[1] "h5pe2"
[1] "h5pe3"
[1] "h5mn1"
[1] "h5mn2"
[1] "h5mn3"
[1] "h5mn4"
[1] "h5cj1d"
[1] "h5cj1e"
[1] "h5cj1f"
[1] "h5cj1b"
[1] "h5cj1c"
[1] "h5cj1a"
h1fs1 h1fs3 h1fs4 h1fs5 h1fs6 h1fs7 h1fs11 h1fs15 h1fs16 h1fs17 h1fs19 h1ee14
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h1ee12 h1ds14 h1ed9 h1ed7 h1ds3 h1fv7 h1fv1 h1fv8 h1jo9 h1ds13 h1ds12 h1ds11
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h3sp5 h3sp6 h3sp7 h3sp8 h3sp9 h3sp10 h3sp11 h3sp12 h3sp13 h3ec56 h3ds18h h3ds18a
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h3ds18i h3to49 h3ds6 h3ds5 h3ds4 h4id5h h4mh18 h4mh19 h4mh20 h4mh21 h4mh22 h4mh23
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h4mh24 h4mh25 h4mh26 h4mh27 h4id5j h4pe6 h4pe14 h4pe22 h4pe30 h4pe7 h4pe15 h4pe23
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h4pe31 h4mh3 h4mh4 h4mh5 h4mh6 h4mh2 hdl ldl tg h4bpcls h4ds7 h4ds19
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h4ds14 h4ds20 h4ds6 h4ds5 h4ds4 h4cj17 h5id6g h5ss0a h5ss0b h5ss0c h5ss0d h5ss0e
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h5id6i h5pe1 h5pe2 h5pe3 h5mn1 h5mn2 h5mn3 h5mn4 h5cj1d h5cj1e h5cj1f h5cj1b
"factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor" "factor"
h5cj1c h5cj1a
"factor" "factor"
[1] "Recoding Missing Numeric Variables"
[1] "Number of numeric variables being recoded : 7"
[1] "Numeric variables being recoded : "
[1] "h1ed2" "h4bmi" "hba1c" "crp" "h4waist" "h4sbp" "h4dbp"
#parse out dataset components
wave_data <- dataset_list$wave_data
full_dataset <- dataset_list$full_dataset
ds_raw <- dataset_list$ds_raw_outcome
ds <- dataset_list$ds_final
#ml splits of the data
training_df <- dataset_list$training_df
validation_df <- dataset_list$validation_df
testing_df <- dataset_list$testing_df
The following table details the values present in the outcome variable. As we can see, we need to convert the variable into a factor, and drop the NAs.
ds_raw %>%
group_by(.data[[outcome]]) %>%
summarise(total = n(), type = class(.data[[outcome]]))i_drug <dbl> | total <int> | type <chr> | ||
|---|---|---|---|---|
| 0 | 8930 | numeric | ||
| 1 | 324 | numeric | ||
| NA | 95 | numeric |
ds %>%
group_by(.data[[outcome]]) %>%
summarise(total = n(), type = class(.data[[outcome]]))i_drug <fctr> | total <int> | type <chr> | ||
|---|---|---|---|---|
| 0 | 8930 | factor | ||
| 1 | 324 | factor |
print(str_c('Total number of subjects in unprocessed outcome data: ', nrow(ds_raw)))[1] "Total number of subjects in unprocessed outcome data: 9349"
print(str_c('Total number of subjects in processed outcome data: ', nrow(ds)))[1] "Total number of subjects in processed outcome data: 9254"
print(str_c('Total subjects dropped due to NA or skip: ', nrow(ds_raw)-nrow(ds)))[1] "Total subjects dropped due to NA or skip: 95"
The table also demonstrates that the classes are very imbalanced, with about 14x the negative class as compared with the positive class. This is displayed graphically for easy consumption below.
ds_raw %>%
explore_outcome(ds, outcome)The original raw outcome data is displayed by converting it to a factor. For more information about its actual type, investigate the ds_raw dataframe. It is most likely a double.
After dropping the NAs, we see that we now have 9168 rows, which is consistent with the table above. The distribution of the data has a distinct imbalance as noted above. I think this warrants using pr_auc as the optimization and selection metric.
Here, we comment about the general characteristics of the data based on the provided visualizations. We comment on missingness of data, any strange or unusual behavior (e.g., strong imbalances), and any correlation that sticks out.
#Report about the characteristics of the subjects left out of the join
ds %>% explore_dropped()
# Visualize distributions of variables of interest
ds %>%
dplyr::select(-aid) %>%
graph_bar_discrete(df = .,
plot_title = "Distributions of Discrete Variables",
max_categories = 50,
num_rows = 3,
num_cols = 3,
x_axis_size = 12,
y_axis_size = 12,
title_size = 15)ds %>%
graph_missing(only_missing = TRUE,
title = "Percent Missing",
box_line_size = .5,
label_size = .5,
x_axis_size = 12,
y_axis_size = 12,
title_size = 15)
ds %>%
#dplyr::select(1:20) %>%
pairwise_cramers_v() %>%
plot_cramer_v(x_axis_angle = 90,
plot_title = "Association among Categorical Variables",
interactive = TRUE)The correlation plot suggests that there are several variables we might think about removing. Firstly, there are moderate correlations among the individual predictor blocks; this is due to the way that they are ordered, since they’re essentially grouped into subsets. Addtionally, some predictor pairs have extremely high correlations, like (h4id5j, h4mh26) (correlation of 0.72), and many variables in the h4mh* series. (h3id5j, h4id5h) also has high correlation > 0.5. We may want to consider removing several of these in addition to the age variables because it may cause feature importance masking within our approaches.
The following table displays the mean performance metrics for the bootstrapped models on the validation set, removing values for which there are NA.
mean_bs_rf_perf <- bs_rf_perf %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
mutate(model = 'bs_rf') %>%
dplyr::select(model, everything())
mean_bs_rf_perfmodel <chr> | accuracy <dbl> | mpce <dbl> | sens <dbl> | spec <dbl> | ppv <dbl> | npv <dbl> | roc_auc <dbl> | pr_auc <dbl> | tns <dbl> | |
|---|---|---|---|---|---|---|---|---|---|---|
| bs_rf | 0.9647459 | 0.4922064 | 0.016875 | 0.9987122 | 0.3197822 | 0.9659289 | 0.8116213 | 0.1732577 | 1783.7 |
As shown, the bootstrapped models tend to have high specificity but low sensitivity, indicating that there is a challenge in identifying subjects with suicidal ideation.
boot_rf_mdi <- list(bs_rf_mdi) %>%
get_median_placement(use_base_var = TRUE) %>%
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, att_name, overall_rank)
head(boot_rf_mdi, 20)predictor <chr> | att_name <chr> | overall_rank <int> | ||
|---|---|---|---|---|
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 1 | ||
| age_w5 | age_w5 | 2 | ||
| h4sbp | S27 SYSTOLIC BLOOD PRESSURE-W4 | 3 | ||
| h4bmi | S27 BMI -W4 | 4 | ||
| h4waist | S27 MEASURED WAIST (CM) -W4 | 5 | ||
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 6 | ||
| crp | crp | 7 | ||
| one_race5 | one_race5 | 8 | ||
| h1ed2 | S5Q2 FREQ-SKIPPED SCHOOL-W1 | 9 | ||
| h3bmi | h3bmi | 10 |
This table returns the MDI variable importance ranks that returned from each of the bootstrapped models.
# Needs to be fixed so that axes don't overlap each other and obscure understanding
plot_placement_boxplot(list(bs_rf_mdi))Now, let’s look at the permutation importance:
met <- 'pr_auc'
bs_rf_perm <- bs_rf_perm_plt %>%
get_permute_placement(metric_oi=met) %>%
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, everything())
head(bs_rf_perm, 20)predictor <chr> | att_name <chr> | overall_rank <int> | pr_auc <dbl> | |
|---|---|---|---|---|
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 1 | 0.029647167 | |
| h4ds5 | S21Q5 12 MO,OFT SELL DRUGS-W4 | 2 | 0.010081669 | |
| h4cj17 | S22Q17 EVER INCARCERATED-W4 | 3 | 0.009305191 | |
| h3ds5 | S26Q5 12 MO,OFT SELL DRUGS-W3 | 4 | 0.005107186 | |
| h5ss0b | S10Q0B PAST 7 DAY DEPRESSED-W5 | 5 | 0.004402274 | |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 6 | 0.004022309 | |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 7 | 0.003096702 | |
| h5cj1d | S14Q1D 12 MO GET IN PHYS FIGHT-W5 | 8 | 0.003067237 | |
| h5ss0a | S10Q0A PAST 7 DAY SHAKE BLUE-W5 | 9 | 0.002639547 | |
| h4ds7 | S21Q7 12 MO,OFT PART PHYS FIGHT/GRP-W4 | 10 | 0.002618647 |
In this step, we assess the differences generated between the different types of importances.
cbind(boot_rf_mdi[1:20,], dplyr::select(bs_rf_perm[1:20,], -all_of(met)))predictor <chr> | att_name <chr> | overall_rank <int> | predictor <chr> | |
|---|---|---|---|---|
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 1 | h5cj1c | |
| age_w5 | age_w5 | 2 | h4ds5 | |
| h4sbp | S27 SYSTOLIC BLOOD PRESSURE-W4 | 3 | h4cj17 | |
| h4bmi | S27 BMI -W4 | 4 | h3ds5 | |
| h4waist | S27 MEASURED WAIST (CM) -W4 | 5 | h5ss0b | |
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 6 | h4ds6 | |
| crp | crp | 7 | h3to49 | |
| one_race5 | one_race5 | 8 | h5cj1d | |
| h1ed2 | S5Q2 FREQ-SKIPPED SCHOOL-W1 | 9 | h5ss0a | |
| h3bmi | h3bmi | 10 | h4ds7 |
As shown, the MDI importance suffers from imbalances due to the number of values associated with a predictor. Because the wave ages have so many more values than the other factors, this artificially inflates their importance in MDI. The permutation importance is more intuitive.
plot_permute_var_imp(bs_rf_perm, metric = met)In this step, we model the relation between the outcomes and the predictors using a linear regression with L2 regularization. This drives the importance of unimportant and redudant features towards zero.
mean_bs_lasso_perf <- bs_lasso_perf %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
mutate(model='bs_lasso') %>%
dplyr::select(model, everything())
mean_bs_lasso_perfmodel <chr> | accuracy <dbl> | mpce <dbl> | sens <dbl> | spec <dbl> | ppv <dbl> | npv <dbl> | roc_auc <dbl> | pr_auc <dbl> | tns <dbl> | |
|---|---|---|---|---|---|---|---|---|---|---|
| bs_lasso | 0.9434054 | 0.4101568 | 0.21 | 0.9696865 | 0.2020515 | 0.9716446 | 0.722135 | 0.1371521 | 1731.86 |
boot_lasso_mdi <- list(bs_lasso_mdi) %>%
get_median_placement(use_base_var = TRUE) %>%
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, att_name, overall_rank)
head(boot_lasso_mdi, 20)predictor <chr> | att_name <chr> | overall_rank <int> | ||
|---|---|---|---|---|
| one_race5 | one_race5 | 1 | ||
| age_w5 | age_w5 | 2 | ||
| bio_sex | BIOLOGICAL SEX-W1 | 3 | ||
| h4cj17 | S22Q17 EVER INCARCERATED-W4 | 4 | ||
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 5 | ||
| h1ed2 | S5Q2 FREQ-SKIPPED SCHOOL-W1 | 6 | ||
| h4bmi | S27 BMI -W4 | 7 | ||
| h1ds3 | S29Q3 LIE TO PARENTS ABOUT WHEREABOUT-W1 | 8 | ||
| h3sp7 | S12Q7 PAST 7 DAYS FELT AS GOOD AS OTH-W3 | 9 | ||
| h5cj2b | S14Q2B 12 MO SAW ONE SHOOT/STAB PERS-W5 | 10 |
plot_placement_boxplot(list(bs_lasso_mdi))bs_lasso_perm <- bs_lasso_perm_plt %>%
get_permute_placement(metric_oi=met) %>% #set in random forest section
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, everything())
head(bs_lasso_perm, 20)predictor <chr> | att_name <chr> | overall_rank <int> | pr_auc <dbl> | |
|---|---|---|---|---|
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 1 | 0.054165390 | |
| h4ds5 | S21Q5 12 MO,OFT SELL DRUGS-W4 | 2 | 0.012534262 | |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 3 | 0.011379094 | |
| h4mh24 | S14Q24 PAST 7 DAYS FELT HAPPY-W4 | 4 | 0.011105397 | |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 5 | 0.010869625 | |
| one_race5 | one_race5 | 6 | 0.009682284 | |
| h5pe1 | S9Q1 ALWAYS OPTIM ABOUT FUT-W5 | 7 | 0.009617185 | |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 8 | 0.009388152 | |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 9 | 0.008662670 | |
| hdl | hdl | 10 | 0.008567205 |
plot_permute_var_imp(bs_lasso_perm, metric = met)Now, we compare the feature importances generated by the two different approaches. The traditional method of evaluating feature importance for regression methods is through analysis of the coefficients.
cbind(boot_lasso_mdi[1:20,], dplyr::select(bs_lasso_perm[1:20,], -met))predictor <chr> | att_name <chr> | overall_rank <int> | predictor <chr> | |
|---|---|---|---|---|
| one_race5 | one_race5 | 1 | h5cj1c | |
| age_w5 | age_w5 | 2 | h4ds5 | |
| bio_sex | BIOLOGICAL SEX-W1 | 3 | h4mh2 | |
| h4cj17 | S22Q17 EVER INCARCERATED-W4 | 4 | h4mh24 | |
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 5 | h4ds6 | |
| h1ed2 | S5Q2 FREQ-SKIPPED SCHOOL-W1 | 6 | one_race5 | |
| h4bmi | S27 BMI -W4 | 7 | h5pe1 | |
| h1ds3 | S29Q3 LIE TO PARENTS ABOUT WHEREABOUT-W1 | 8 | h5mn3 | |
| h3sp7 | S12Q7 PAST 7 DAYS FELT AS GOOD AS OTH-W3 | 9 | h3to49 | |
| h5cj2b | S14Q2B 12 MO SAW ONE SHOOT/STAB PERS-W5 | 10 | hdl |
The following table compares the mean performance of bootstrapped random forests to the mean performance of bootstrapped LASSO methods.
bs_comp_perfs <- bind_rows(mean_bs_rf_perf, mean_bs_lasso_perf)
bs_comp_perfsmodel <chr> | accuracy <dbl> | mpce <dbl> | sens <dbl> | spec <dbl> | |
|---|---|---|---|---|---|
| bs_rf | 0.9647459 | 0.4922064 | 0.016875 | 0.9987122 | |
| bs_lasso | 0.9434054 | 0.4101568 | 0.210000 | 0.9696865 |
Here, we look at the aggregated results of the bootstrapped predictors and compare the models generated to each other.
joined_results <- bs_rf_perm %>%
dplyr::select(-met) %>%
full_join(dplyr::select(bs_lasso_perm, -met), by=c("predictor", "att_name"), suffix=c('.rf', '.lasso')) %>%
mutate(mean_rank = (overall_rank.rf+overall_rank.lasso)/2) %>%
arrange(mean_rank)
head(joined_results, 20)predictor <chr> | att_name <chr> | overall_rank.rf <int> | overall_rank.lasso <int> | mean_rank <dbl> |
|---|---|---|---|---|
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 1 | 1 | 1.0 |
| h4ds5 | S21Q5 12 MO,OFT SELL DRUGS-W4 | 2 | 2 | 2.0 |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 6 | 5 | 5.5 |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 7 | 9 | 8.0 |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 14 | 3 | 8.5 |
| one_race5 | one_race5 | 12 | 6 | 9.0 |
| h5cj1d | S14Q1D 12 MO GET IN PHYS FIGHT-W5 | 8 | 11 | 9.5 |
| h3ds5 | S26Q5 12 MO,OFT SELL DRUGS-W3 | 4 | 16 | 10.0 |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 15 | 8 | 11.5 |
| h5pe1 | S9Q1 ALWAYS OPTIM ABOUT FUT-W5 | 16 | 7 | 11.5 |
The following visualization provides the intuition about the differences in the rankings between model types. They’re ordered by the overall mean importance, and for a given variable, the differences in rank are shown.
# Comparison of top_n features
joined_results %>%
compare_feature_select(interactive = TRUE,
top_n = 100,
opacity = 0.50,
plot_title = "Permutation Importance of Predictors by Model")In this step, we build the final model for the random forest. We use slightly more values in order to come up with the best model, keeping in mind the number of combinations that are required to run to evaluate the grid.
final_rf_perf = NULL
# read file array
final_rf_perfs = load_from_csv(final_rf_perf, results_directory, no_tasks)
# get the index of the best performance and best performance
final_rfmodel_ind <- which.max(dplyr::select(final_rf_perfs, met) %>% pull(met))
final_rf_perf <- final_rf_perfs %>% slice(final_rfmodel_ind)
# load the model of interest
final_model_rf <- load_best_model('final_rf_model', results_directory, final_rfmodel_ind)
The final random forest performance metrics are shown below:
# show model final performance
print(final_rf_perf)accuracy <dbl> | kap <dbl> | sens <dbl> | spec <dbl> | ppv <dbl> | npv <dbl> | mcc <dbl> | j_index <dbl> | bal_accuracy <dbl> | |
|---|---|---|---|---|---|---|---|---|---|
| 0.9372973 | 0.2786264 | 0.40625 | 0.956327 | 0.25 | 0.978236 | 0.2876684 | 0.362577 | 0.6812885 |
final_rf_perm_plt = NULL
# read best perm plt
final_rf_perm_plt = load_from_csv(final_rf_perm_plt, results_directory, best_ind=final_rfmodel_ind)
final_rf_perm <- final_rf_perm_plt %>%
get_permute_placement(metric_oi=met) %>%
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, everything())
head(final_rf_perm, 20)predictor <chr> | att_name <chr> | overall_rank <int> | pr_auc <dbl> | |
|---|---|---|---|---|
| h4cj17 | S22Q17 EVER INCARCERATED-W4 | 1 | 0.06523748 | |
| h5ss0b | S10Q0B PAST 7 DAY DEPRESSED-W5 | 2 | 0.04289668 | |
| h5ss0a | S10Q0A PAST 7 DAY SHAKE BLUE-W5 | 3 | 0.04287412 | |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 4 | 0.03083555 | |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 5 | 0.02819290 | |
| h5ss0c | S10Q0C PAST 7 DAY HAPPY-W5 | 6 | 0.02400390 | |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 7 | 0.02234086 | |
| h1ed7 | S5Q7 RECEIVED OUT-OF-SCHL SUSPENSION-W1 | 8 | 0.02232680 | |
| h4mh27 | S14Q27 PAST 7 DAYS FELT DISLIKED-W4 | 9 | 0.02160583 | |
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 10 | 0.02159901 |
plot_permute_var_imp(final_rf_perm, metric = met)This section investigates the differences in the bootstrap results vs the features generated from the random forest final model. The following table shows the overall differences in rank.
rf_joined_results <- final_rf_perm %>%
dplyr::select(-met) %>%
full_join(dplyr::select(bs_rf_perm, -met), by=c("predictor", "att_name"), suffix=c('.final', '.bootstrap')) %>%
mutate(mean_rank = (overall_rank.final + overall_rank.bootstrap)/2) %>%
arrange(mean_rank)
head(rf_joined_results, 20)predictor <chr> | att_name <chr> | overall_rank.final <int> | overall_rank.bootstrap <int> | |
|---|---|---|---|---|
| h4cj17 | S22Q17 EVER INCARCERATED-W4 | 1 | 3 | |
| h5ss0b | S10Q0B PAST 7 DAY DEPRESSED-W5 | 2 | 5 | |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 4 | 7 | |
| h5ss0a | S10Q0A PAST 7 DAY SHAKE BLUE-W5 | 3 | 9 | |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 5 | 14 | |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 7 | 15 | |
| h3ds5 | S26Q5 12 MO,OFT SELL DRUGS-W3 | 18 | 4 | |
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 10 | 13 | |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 22 | 6 | |
| h5mn2 | S13Q2 LAST MO CONFID HANDLE PERS PBMS-W5 | 13 | 17 |
The following plot provides visualizations for the difference in the final model rankings vs the bootstrap.
# Comparison of top_n features
rf_joined_results %>%
compare_feature_select(sel_cols = c("overall_rank.final", "overall_rank.bootstrap"),
interactive = TRUE,
top_n = 100,
opacity = 0.50,
plot_title = "Permutation Importance of Predictors: Final vs. Bootstrap")Now, we create the final model for LASSO. There is no substantial difference between this method and the bootstrap methods, other than the data upon which the model is being built.
final_lasso_perf = NULL
# read file array
final_lasso_perfs = load_from_csv(final_lasso_perf, results_directory, no_tasks)
# get the index of the best performance and best performance
final_lassomodel_ind <- which.max(dplyr::select(final_lasso_perfs, met) %>% pull(met))
final_lasso_perf <- final_lasso_perfs %>% slice(final_lassomodel_ind)
# load the model of interest
final_model_lasso <- load_best_model('final_lasso_model', results_directory, final_lassomodel_ind)
The final LASSO performance metrics are shown below:
# show model final performance
print(final_lasso_perf)accuracy <dbl> | kap <dbl> | sens <dbl> | spec <dbl> | ppv <dbl> | |
|---|---|---|---|---|---|
| 0.9372973 | 0.1947225 | 0.265625 | 0.9613662 | 0.1976744 |
final_lasso_perm_plt = NULL
#load best index permutation from file
final_lasso_perm_plt = load_from_csv(final_lasso_perm_plt, results_directory, best_ind=final_lassomodel_ind)
final_lasso_perm <- final_lasso_perm_plt %>%
get_permute_placement(metric_oi=met) %>%
add_attribute_names('predictor', full_dataset) %>%
dplyr::select(predictor, everything())
head(final_lasso_perm, 20)predictor <chr> | att_name <chr> | overall_rank <int> | pr_auc <dbl> |
|---|---|---|---|
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 1 | 0.055758799 |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 2 | 0.016530114 |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 3 | 0.012293699 |
| h1ds3 | S29Q3 LIE TO PARENTS ABOUT WHEREABOUT-W1 | 4 | 0.012094270 |
| h4ds5 | S21Q5 12 MO,OFT SELL DRUGS-W4 | 5 | 0.011846507 |
| age_w5 | age_w5 | 6 | 0.010822169 |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 7 | 0.010262974 |
| hdl | hdl | 8 | 0.006980847 |
| h4mh18 | S14Q18 PAST 7 DAYS BOTHERED BY THINGS-W4 | 9 | 0.006813132 |
| h4mh6 | S14Q6 LST MNTH DIFFICULTIES OVERWHELM-W4 | 10 | 0.006293066 |
plot_permute_var_imp(final_lasso_perm, metric = met)This section investigates the differences in the bootstrap results vs the features generated from the LASSO final model. The following table shows the overall differences in rank.
lasso_joined_results <- final_lasso_perm %>%
dplyr::select(-met) %>%
full_join(dplyr::select(bs_lasso_perm, -met), by=c("predictor", "att_name"), suffix=c('.final', '.bootstrap')) %>%
mutate(mean_rank = (overall_rank.final + overall_rank.bootstrap)/2) %>%
arrange(mean_rank)
head(lasso_joined_results, 20)predictor <chr> | att_name <chr> | overall_rank.final <int> | overall_rank.bootstrap <int> | mean_rank <dbl> |
|---|---|---|---|---|
| h5cj1c | S14Q1C 12 MO SELL DRUGS-W5 | 1 | 1 | 1.0 |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 3 | 3 | 3.0 |
| h4ds5 | S21Q5 12 MO,OFT SELL DRUGS-W4 | 5 | 2 | 3.5 |
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 2 | 9 | 5.5 |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 7 | 5 | 6.0 |
| h1ds3 | S29Q3 LIE TO PARENTS ABOUT WHEREABOUT-W1 | 4 | 12 | 8.0 |
| hdl | hdl | 8 | 10 | 9.0 |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 14 | 8 | 11.0 |
| h4mh18 | S14Q18 PAST 7 DAYS BOTHERED BY THINGS-W4 | 9 | 15 | 12.0 |
| age_w5 | age_w5 | 6 | 20 | 13.0 |
The following plot provides visualizations for the difference in the final model rankings vs the bootstrap.
# Comparison of top_n features
lasso_joined_results %>%
compare_feature_select(sel_cols = c("overall_rank.final", "overall_rank.bootstrap"),
interactive = TRUE,
top_n = 100,
opacity = 0.50,
plot_title = "Permutation Importance of Predictors: Final vs. Bootstrap")Here, we compare the features generated by the permutation importance between the two final models.
rf_lasso_final_joined_results <- final_rf_perm %>%
dplyr::select(-met) %>%
full_join(dplyr::select(final_lasso_perm, -met), by=c("predictor", "att_name"), suffix=c('.rf', '.lasso')) %>%
mutate(mean_rank = (overall_rank.rf+overall_rank.lasso)/2) %>%
arrange(mean_rank)
head(rf_lasso_final_joined_results, 20)predictor <chr> | att_name <chr> | overall_rank.rf <int> | overall_rank.lasso <int> | mean_rank <dbl> |
|---|---|---|---|---|
| h3to49 | S28Q49 SINCE JUN95 HAVE DRIVEN DRUNK-W3 | 4 | 2 | 3.0 |
| h4mh2 | S14Q2 HOW OFTEN FEEL ISOLATED-W4 | 5 | 3 | 4.0 |
| h5mn3 | S13Q3 LAST MO FELT THINGS GO YOUR WAY-W5 | 7 | 14 | 10.5 |
| age_w5 | age_w5 | 15 | 6 | 10.5 |
| h1ds3 | S29Q3 LIE TO PARENTS ABOUT WHEREABOUT-W1 | 24 | 4 | 14.0 |
| h4ds6 | S21Q6 12 MO,OFT STEAL SOMETHING/<$50-W4 | 22 | 7 | 14.5 |
| h4dbp | S27 DIASTOLIC BLOOD PRESSURE -W4 | 10 | 20 | 15.0 |
| h3ds5 | S26Q5 12 MO,OFT SELL DRUGS-W3 | 18 | 15 | 16.5 |
| h5ss0b | S10Q0B PAST 7 DAY DEPRESSED-W5 | 2 | 33 | 17.5 |
| h4mh18 | S14Q18 PAST 7 DAYS BOTHERED BY THINGS-W4 | 33 | 9 | 21.0 |
The following visualization provides the intuition about the differences in the rankings between the final model types. They’re ordered by the overall mean importance, and for a given variable, the differences in rank are shown.
# Comparison of top_n features
rf_lasso_final_joined_results %>%
compare_feature_select(sel_cols = c("overall_rank.rf", "overall_rank.lasso"),
interactive = TRUE,
top_n = 100,
opacity = 0.50,
plot_title = "Permutation Importance of Predictors: Random Forest vs Lasso")With the final models generated, we’re now able to compare their performance metrics.
# Comparison of performance metrics
valid_perf <- get_metric_set_from_perfs(perf_list = list(final_rf_perf, final_lasso_perf)) %>%
mutate(model = c('rf', 'lasso'))
testing_perf <- get_metric_set_from_models(testing_df, list(final_model_rf, final_model_lasso), out=outcome) %>%
mutate(model = c('rf', 'lasso'))Validation and selection. The following table shows the comparison between models in terms of the validation set. We can select our final model based on the best performing model according to the metric.
print(valid_perf)model <chr> | accuracy <dbl> | kap <dbl> | sens <dbl> | spec <dbl> | |
|---|---|---|---|---|---|
| rf | 0.9372973 | 0.2786264 | 0.406250 | 0.9563270 | |
| lasso | 0.9372973 | 0.1947225 | 0.265625 | 0.9613662 |
Testing performance. The following shows the performance of both the models on the test set. Note that although we don’t use this test set to evaluate the final models, we can still see how our selected method would have performed.
print(testing_perf)model <chr> | accuracy <dbl> | kap <dbl> | sens <dbl> | spec <dbl> | |
|---|---|---|---|---|---|
| rf | 0.9416216 | 0.2202379 | 0.2903226 | 0.9642058 | |
| lasso | 0.9481081 | 0.1457596 | 0.1612903 | 0.9753915 |
The following plots show a comparison between the performance of the models on the validation and test sets. Again, we don’t choose the model based on the test set, but curiosity dictates that we view this performance.
# Show plots side by side
metrics_of_interest = c('model', 'accuracy', 'bal_accuracy', 'mpce', 'sens', 'spec', 'ppv', 'npv', 'pr_auc', 'roc_auc')
valid_plt <- plot_metric_set(dplyr::select(valid_perf, all_of(metrics_of_interest)), plot_title = "Model comparison for validation set")
test_plt <- plot_metric_set(dplyr::select(testing_perf, all_of(metrics_of_interest)), plot_title = "Model comparison for testing set")
gridExtra::grid.arrange(gridExtra::arrangeGrob(valid_plt, test_plt, ncol=2, nrow=1))